Notebook

These are some notes I'm making while experimenting with scaling the indexer.

Scaling problems¶

We ran with 10,000 W/ARCs, but got some troubling timings. Having tweaked that task number, 5 allowed with 1GB RAM each (TODO Add config details), we have a reasonably fast map phase, taking about two hours to process all 10,000 (and so implying up to 90 hours to process them all, but keep in mind that the total is only as fast as the slowest jobs, i.e. the big WARC files dominate at smaller job sizes, and there was some competition for cluster time). The first time the JISC 1996-2010 collection was indexed, it only required about a soild day's worth of processing time, i.e. about 26 hours.

| # Inputs | # Reducers | Finish Time (Map)   | Finish Time (Map+Reduce) |
--------------------------------------------------------------------------
|    100   | 10         | 8mins, 41sec        | 8mins, 56sec             | 4.4.0
|  1,000   | 10         | 31mins, 3sec        | 1hrs, 24mins, 19sec      | 4.6.1
| 10,000   | 10         | 2hrs, 45mins, 21sec | 21hrs, 25mins, 56sec     | 4.4.0

However, the 10 reducers are failing. They run twice, the first time crashing out with:

Error: java.io.IOException: No space left on device
    at java.io.FileOutputStream.writeBytes(Native Method)
    at java.io.FileOutputStream.write(FileOutputStream.java:282)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:104)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.mapred.IFileOutputStream.write(IFileOutputStream.java:84)
    at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:49)
    at java.io.DataOutputStream.write(DataOutputStream.java:90)
    at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:226)
    at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:157)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2699)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2640)

then running ok the second time, presumably because there is only enough disk space on some of the machines. Others have hit this problem, and this indicates that we might be able to sort things out by clearing up the temporary space that Hadoop is configured to use (TODO add config param info). However, it also implies we are likely to hit an upper limit on the size of job we can process due to the limited about of temp space we have. Note that this should be nothing to do with HDFS free space, because the system temp space is usually held on a different drive to the DFS volumes.

| # Inputs | # Reducers | Shuffle Time         | Sort Time     | Total Reduce Time   |
--------------------------------------------------------------------------------------
|    100   | 10         | 5mins, 21sec         | 30sec         | 8mins, 51sec        |
|  1,000   | 10         | 49mins, 9sec         | 23sec         | 1hrs, 18mins, 33sec | 4.6.1
| 10,000   | 10         | 10hrs, 41mins, 35sec | 0sec          | 10hrs, 42mins, 9sec |
|          |            | 1hrs, 19mins, 15sec  | 43mins, 12sec | 8hrs, 36mins, 27sec |

Similarly, it's worth noting that even when it works, the sort is taking 7-10 hours (which is why it takes over 20 hours when it fails once). Somewhat oddly, every single reducer failed on disk space the first time around, i.e. roughtly simultanously, and then worked (no clear correlation between failed nodes the first time, i.e. the same rack etc.). That implies that some kind of temp-space job contention might be the issue.

Note that there is a lot of lines like this:

ERROR WARCIndexerReducer - No appropriate response record found for: sha1:223JBF7A4BH6TNGCAI2MIPGWOPBNJLBB_http://news.bbcimg.co.uk/media/images/48244000/jpg/_48244565_lorenzo_reuters226i.jpg (revisit)

This is a consequence of the fact that that the small sample means the deduplication strategy is failing. Some WARCs are mostly revisits.

TODO What is the ARC/WARC composition of the 10,000? ARC 5593 WARC 4407

ARC 442703 WARC 4494

https://issues.apache.org/jira/browse/SOLR-4816 means we are suffering on indexing throughput.

https://wiki.apache.org/solr/SolrCloud

Also, it seems we are putting too much pressure on the sort now. Perhaps partly due to the link extraction and partly due to the higher binary limit allowing more resources to use up more of the 1MB text field size limit.

Deduplication strategies¶

We are using the reduce step as out deduplication strategy. Items with the same URL and content hash are grouped together, and only a single SOLR record is submitted for each one.

To resolve this, we had to properly calculate the hash of the ARCs and allow for multiple crawl dates, and query Solr during the map to decide whether to send an update to the crawl_dates or not.

Rebuilding the indexer¶

So, the indexer has been rebuilt.

Uses new duplicate handling logic.
Requests compression of the map output.
Face detection, colour extraction.
...

All 'expensive' features are switched on.

Now we need new timings. Started with ten inputs, but a new ten, so numbers will not be directly comparable.

For 10, Total time: 00:21:52. Map, Worst case 00:20:00, most most around 00:08:00. Reduce time, 18 mins but this includes the long running shuffle and sort while awaiting the slowest map. Actual reduce action time was 30-40 seconds.

On 100, hit problems with empty/malformed payloads that killed the job. Fixing this and re-launching.

We are getting DEBUG output from org.apache.zookeeper and it's not clear why.

Rather slow to warm up and get going. Hopefully this is mapper initialisation and we'll pick up speed shortly. OK Looks like SOLR crashed, and the clients are waiting for it. It was an OOM, but actually 'unable to create new native thread', which is a ulimit thing. Need up up the ulimits for the tomcat user and restart the cluster.

Ok, re-running with SOLR rebooted.

For 100: Total time: 00:43:21 Maps, 6-20 mins per input. Reducers, 38 mins overall, but actual submission to Solr only about 2 minutes. (Seems much faster.) This would mean 20 weeks! Need timings from 1000 to confirm relilability of this estimate.

For 1000: Some contention, other indexing jobs running at the same time. Now ArchiveCDXGenerator and sorter jobs kicked in, all competing for map time. Total time: 15:43:40 Maps: Most 1-2 hrs, worst case was around 11 hours! Reduce phase took about two hours, but was overlapping with another job in the reduce phase.

Files taking many hours (>8 hours) to map Processing path: hdfs://nellie-private:54310/ia/PHASE2WARCS/DOTUK-HISTORICAL-1996-2010-PHASE2WARCS-XAAAAZ-20111115000000-000000.warc.gz

Speeding things up¶

To speed things up, we can go to the other extreme and try switching off lots of features.

Disabling both Image and PDF analysis.
- Local test: 77.89 seconds -> 45.185 seconds.
Disabling PDF analysis.
- Local test: 77.89 seconds -> 74.764 seconds.
Disabling Image analysis.
- Local test: 77.89 seconds -> 51.093 seconds.
Up the limit on in-memory content processing from 1MB to 10MB:
- Local test: 81.992 seconds -> 77.89 seconds.
Dropping maximum text to extract from 1024K to 1K:
- Local test: 45.185 seconds -> 42.631 seconds.
Dropping maximum bytes to allow Tika to parse from ALL to 1K:
- Local test: 64.982 seconds -> 53.949 seconds.

So, dropping the image analysis made a large difference. Given the pressures involved right now, it probably makes more sense to disable these features (which are of relatively little interest to the main BUDDAH researchers right now).

Rerunning with image and PDF features turned off.

For 1000: Note some contention with previous job, but mostly in the reducer phase. Total time: 10:27:01. Mappers better behaved, worst case now 04:20:13. Reduce phase approx 3.5 hours, but heavy contention with other reduces from previous job.

4hr worst case map: hdfs://nellie-private:54310/ia/PHASE2WARCS/DOTUK-HISTORICAL-1996-2010-PHASE2WARCS-XAABLX-20111115000000-000000.warc.gz

Also bad: hdfs://nellie-private:54310/ia/PHASE2WARCS/DOTUK-HISTORICAL-1996-2010-PHASE2WARCS-XAAANR-20111115000000-000001.warc.gz hdfs://nellie-private:54310/ia/PHASE2WARCS/DOTUK-HISTORICAL-1996-2010-PHASE2WARCS-XAAAAZ-20111115000000-000000.warc.gz

So, setting up a local test with the worst file (XAAAAZ-20111115000000-000000, which is only 0.6GB, so it's probably not raw size that's the problem). Hang on, that's a bad idea, cos it's probably take 4 hours! Adding logging to see where it gets stuck... Oh dear. It worked fine. "Finished in 2791.518 seconds."

Running again, but excluding most formats from Tika processing (as we usually do): 2749.475 seconds! Didn't beleve it, so disabled the excludes (i.e. all in) once more: 2018.007 seconds!?

Ok, so cleaning up code and disabling Tika for problematic types, and rerunning on the 100 with no contention gives... Total time: 00:32:07 Mappers roughly 20-30mins. Reducers e.g. 00:01:22 i.e. around a minute.

Trying again with the 1000, although with some contention and HDFS is extremely full which may be causing problems... Total time:

Other things to try:

Disable recursive parsing in Tika.
ONLY hash the first X bytes, and use that for dedup.

100, 30 mins?

Okay, so back on the cluster, with the solr check switched off (relying on updates instead of managing that myself and querying for every resources), and WHOA that's better. The shuf-100 job runs in about 10mins (instead of 30mins) Total time: Mappers: 10 mins. Reducers: FAILED

Still pretty slow on the reduce. 35 might be hammering it, but still. Maybe need to try cutting down on some fields, e.g. href indexing, possibly even worth ignoring host-level links and just do domain-level.

Ah, no, my fault. It's the new crawl_years field. Nice work Jackson. So, taken that out.

Ran again with 100 inputs. Mapper nice and quick, and 35 reducers coped this time. Total time: 00:17:52 Mappers: c. 8 mins. Reducers: c. 10 mins.

Ran with 1000 inputs, mappers quite quick, but reducers kept dying. Dropping the number of reducers to 20, and it seems stable.

Total time: > 01:57:00 Mappers: c. 30 mins. Reducers: lots more time: > 1hr. Hmmm, after about 94.33% of the reduce phase, Solr is locking up. Kill the job and it slowly recovers, which is good.

So, things to try next: switch off link analysis. Trying it locally, on the XAAAAZ-20111115000000-000000 test file. Finished in 3800.776 seconds. Hmmm. Totes inconclusive due to variability of timings on the laptop.

So, on the cluster, and without links, for the 1000: After ten minutes, 85% done! After 15 minutes, 99% done! After 20 mins, 99.95% done! After 25 mins, 100% done! (only a couple of WARCs at the end, so more to be gained by running more inputs at once) After c.28 mins, sort is also complete. Reducers stuck after nearlu eight hours! c.94% complete. KILLING

So, trying dropping reducers to 10. Still dying, but maybe Solr is very grumpy. Restarting Solr. So, after 01:51:01 it is done. Mappers are quick, index still slow. i.e. about 30-odd mins mappers, about a hour and a quarter indexing.

Oh, links were still switched on!? Trying again without the links... Emptying SOLR...

Locally, switching off multiple features to see how it changes things. no host links, no binary shingling, no 'elements used': Finished in 2399.014 seconds. Hm. Ok, also dropping text payload right down to 10K: Finished in 2657.414 seconds.

So, back on the cluster, with the links off and an empty target Solr: Total time: 01:06:13 i.e. nearly half the time.

Also disabled first_bytes, but left data in Solr: Total time: 01:26:03 Slightly longer, probably because it was doing updates not just replacements.

So, cleared the data out, and also dropped the text payload size down to 10KB: Total time: 01:03:54 Nice.

Finally, knocking down the number of reducers to 5, to see if that makes much difference. On an empty Solr. Yes, it did run a bit slower, but not massively. Total time: 01:27:06

Now running on 10,000, leaving reducers at 5... Map time, about two hours! Reduce Copy kinda slow: reduce > copy (2215 of 10000 at 7.75 MB/s) CRASH with only 5 reducers we run out of disk space...

Upping back to 10 reducers.

Try 15? First, lets try to see why there's so much data.

Dropped the elements_used, in case there was some weirdness there. Was the same (crashed out of disk space during shuffle).

Dropping the hosts, in case it's those cheeky link farms that are to blame.

Little difference. Trying dropping text load.

NOTE Looked in

/mapred/local/dir/taskTracker/anjackson/jobcache/job_201402191107_1551/attempt_201402191107_1551_r_000004_0/output

and the output is clearly NOT compressed. And some are BIG, and look like link-farm mess.

Even after reducing the text load to 50K, this uncompressed data still failed on some reducers. Eventually some got through, only to cripple the Solr server (still just 10 reducers). Those that got to Solr failed like this.

2014-03-30 16:29:26 INFO  WARCIndexerReducer:111 - Submitted 500 docs [0]
2014-03-30 16:30:30 ERROR WARCIndexerReducer:116 - WARCIndexerReducer.reduce(): No live SolrServers available to handle this request:[http://192.168.1.180:8994/solr/jisc3]
org.apache.solr.client.solrj.impl.CloudSolrServer$RouteException: No live SolrServers available to handle this request:[http://192.168.1.180:8994/solr/jisc3]
    at org.apache.solr.client.solrj.impl.CloudSolrServer.directUpdate(CloudSolrServer.java:351)
    at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:510)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
    at uk.bl.wa.solr.SolrWebServer.add(SolrWebServer.java:106)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.checkSubmission(WARCIndexerReducer.java:110)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:84)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:29)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://192.168.1.180:8994/solr/jisc3]
    at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:354)
    at org.apache.solr.client.solrj.impl.CloudSolrServer$1.call(CloudSolrServer.java:332)
    at org.apache.solr.client.solrj.impl.CloudSolrServer$1.call(CloudSolrServer.java:329)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Cannot talk to ZooKeeper - Updates are disabled.
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
    at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:285)
    ... 7 more
2014-03-30 16:39:49 ERROR WARCIndexerReducer:116 - WARCIndexerReducer.reduce(): No live SolrServers available to handle this request:[http://192.168.1.180:8988/solr/jisc3, http://192.168.1.180:8983/solr/jisc3, http://192.168.1.180:8996/solr/jisc3, http://192.168.1.180:8994/solr/jisc3]
org.apache.solr.client.solrj.impl.CloudSolrServer$RouteException: No live SolrServers available to handle this request:[http://192.168.1.180:8988/solr/jisc3, http://192.168.1.180:8983/solr/jisc3, http://192.168.1.180:8996/solr/jisc3, http://192.168.1.180:8994/solr/jisc3]
    at org.apache.solr.client.solrj.impl.CloudSolrServer.directUpdate(CloudSolrServer.java:351)
    at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:510)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
    at uk.bl.wa.solr.SolrWebServer.add(SolrWebServer.java:106)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.checkSubmission(WARCIndexerReducer.java:110)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:84)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:29)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://192.168.1.180:8988/solr/jisc3, http://192.168.1.180:8983/solr/jisc3, http://192.168.1.180:8996/solr/jisc3, http://192.168.1.180:8994/solr/jisc3]
    at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:354)
    at org.apache.solr.client.solrj.impl.CloudSolrServer$1.call(CloudSolrServer.java:332)
    at org.apache.solr.client.solrj.impl.CloudSolrServer$1.call(CloudSolrServer.java:329)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Cannot talk to ZooKeeper - Updates are disabled.
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
    at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:285)
    ... 7 more

So, it seems the issue is simply that of scale. The JISC collection has very large numbers of resources per input file, and this stretches the size of the mapper outputs to the system's limits.

Trying one more run, switching on the PDF checker, and enabling compression, to see how it goes... That seemed to work! Output/temp files look compressed. Also, PDF Preflight tests appeared to add a negligible amount of processing (all done at 02:18:00). Shuffle & Merge seemed to work ok, although some disk space grumbling. Sadly, SOLRs still down, but 5min pause should help, I think. Restarting SOLRs... So, worked a bit, but killed the SOLRs pretty quickly. After 07:30:00 the merge was complete (although many finished before then - using the host as the key is not very balanced). BUT after 15:00:00 still locked with only some successful submission. A coupled of reducers got over 69%.

Just realised all the reducers are running on the same nodes! That's not helping... Num reducers reduced to one?

To confirm this, I'd like to understand what happened for the earlier indexing processes. For AADDA, I assume the difference is largely down to the links. For LDWA, it seems unclear, as I recall Roger processing that in one go!

Ok, so confirmed with Roger that he did it in c.34 chunks of about about a thousand WARCs each (33,102 in total). Given there are 1.1 billion items in the index, this still seems to be rather good performance compared to what we see now. Perhaps the number of reducers per node was lower? Maybe only two, at 2GB each.

Rough history.

First, 1GB got both, and up to 8 of either mappers or reducers. Then, 2GB and 2/2 mappers/reducers. Currently, 1GB and 5/5 mappers/reducers.

Timing is right for the LDWA index to be during the 2R/node period, which may well explain why that worked ok.

Indexing directly onto HDFS¶

The second JISC2 index appears to have failed, so major change of tack required.

Indexing Old News¶

With Lewis, I chopped up some Cloudera code and managed to build multiple shards directly on HDFS during the Reduce phase.

Working through how to index locally during MapReduce. Using JISC2 as test data. Indexed 9, got 63 pages, including "lsidyv10a49/p10" Next, 100 items from the tail, and then the 9 from the top, to check the old ones are not overwritten/lost.

So, in 3mins, 100 issues pulled down and indexed. 622 distinct pages. Now attempting 10,000! 57 minutes! i.e. the whole thing in a day! One Mapper took three-times longer, which is a bit weird. Ran another 10,000, and 57 mins again! With one slow mapper! Perhaps some of the nodes are smaller and slower than the others. Running with 50,000, mapper is nice and fast. should be c. 5 hrs for linear speedup. So, reducers occasionally failed, with only 1GiB of RAM. Upped to 1.5GiB.

But eventually (10 hrs) they all ran leaving an index with 455,122 pages in it. Rerunning with increased RAM and empty indexes, to check it's all ok. 4hrs, 37mins, 49sec all good. Should be shorter, as most of the reducers took 30 mins, but one node is slow (openstack8) at 1.5hrs.

hdfs://openstack2.ad.bl.uk:8020/user/anjackson/newindex1

Using the new output format should be neater, but depends on Hadoop 2.x.x which will no doubt cause PAIN. I can use the hacked together logic easily enough.

2014-04-09 16:38:22,343 INFO org.apache.solr.core.CoreContainer: registering core: core1
2014-04-09 16:38:22,343 INFO org.apache.solr.core.SolrCore: QuerySenderListener sending requests to Searcher@5a90f357 main{StandardDirectoryReader(segments_q:2503 _ti(4.4):C49232 _mi(4.4):C2160 _mq(4.4):C127 _v7(4.4):C5538 _pi(4.4):C100 _xg(4.4):C5570 _qz(4.4):C86 _yk(4.4):C430 _ta(4.4):C90 _v1(4.4):C115 _vh(4.4):C2269 _vz(4.4):C87 _wx(4.4):C92 _wz(4.4):C15 _x6(4.4):C157 _xr(4.4):C596 _y0(4.4):C533 _ya(4.4):C553 _xq(4.4):C1960 _xs(4.4):C569 _xt(4.4):C228 _xu(4.4):C135 _yd(4.4):C24 _ye(4.4):C8 _yj(4.4):C56 _yl(4.4):C55 _ym(4.4):C65 _yn(4.4):C56 _yo(4.4):C48 _yp(4.4):C67 _yq(4.4):C21)}
2014-04-09 16:38:23,447 INFO org.apache.solr.update.LoggingInfoStream: [IFD][main]: init: current segments file is "segments_q"; deletionPolicy=org.apache.solr.core.IndexDeletionPolicyWrapper@75bb31b9
2014-04-09 16:38:23,500 INFO org.apache.solr.update.LoggingInfoStream: [IFD][main]: init: load commit "segments_n"
2014-04-09 16:38:24,995 INFO org.apache.solr.update.LoggingInfoStream: [IFD][main]: init: load commit "segments_q"
2014-04-09 16:38:25,145 INFO org.apache.solr.core.SolrCore: [core1] webapp=null path=null params={event=firstSearcher&q=static+firstSearcher+warming+in+solrconfig.xml&distrib=false} hits=69904 status=0 QTime=2801 
2014-04-09 16:38:25,163 INFO org.apache.solr.core.SolrCore: SolrDeletionPolicy.onInit: commits: num=2
    commit{dir=NRTCachingDirectory(org.apache.solr.store.hdfs.HdfsDirectory@772ce69f lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@7348fb70; maxCacheMB=192.0 maxMergeSizeMB=16.0),segFN=segments_n,generation=23}
    commit{dir=NRTCachingDirectory(org.apache.solr.store.hdfs.HdfsDirectory@772ce69f lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory@7348fb70; maxCacheMB=192.0 maxMergeSizeMB=16.0),segFN=segments_q,generation=26}
2014-04-09 16:38:25,166 INFO org.apache.solr.core.SolrCore: newest commit generation = 26

Ok, so old commits (for syncing) are being dropped, which is fine. All results should be there.

Okay, took (19GiB!) impalad off openstack8 (which appears to be running a desktop) and it runs quicker.

Futzing with Solr JISC2 Newspapers.

shard3: 273,160 docs

Okay, made a new core, a replica of shard3, and swapped over folders. https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201312.mbox/%3Ca793e6444ceb454694f78e027e2fcb3f@BLUPR06MB417.namprd06.prod.outlook.com%3E Filled out all fields including HDFS URLs copied and modified from source node. Then swapped the folders. Then 1,094,330 documents still.

Shard0-3 actually in wrong places, but it still worked!

shard1 refers to /solr/jisc2/core_node3 (has shard3) shard2 refers to /solr/jisc2/core_node2 (has shard2) shard3 was in node4, now /solr/jisc2/core_node5 (has shard4) shard4 refers to /solr/jisc2/core_node1 (has shard1)

So: core_node1:shard1 needs to move to shard1:core_node3 core_node2:shard2 is fine core_node3:shard3 needs to move to shard3:core_node5 core_node5:shard4 needs to move to shard4:code_node1

Doesn't seem to matter until you add documents!

http://openstack9.ad.bl.uk:8983/solr/#/jisc2_shard1_replica1/query

Note these are set up much as per: https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

Notes sent to Lewis¶

Once I’d worked out how to index onto HDFS and avoid openstack8 getting bogged down (I think it’s running a desktop session too?), I managed to get the indexing time down to 3 hours per 50,000 newspapers – i.e. the total run of 192,349 newspapers took about 12 hours to index. In fact, I think this is a significant overestimate as openstack8 was still running somewhat slower than the others, and because there seems to be duplicate rows in the table (see below) implying that around a third of the data was indexed twice.

I used four mappers to pull down the content, and then distributed the results to four reducers that each build a single Solr core. The four simultaneous clients downloading the OCR XML file did not cause DLS any issues, and grabbing them over HTTP did not seem to be a significant bottleneck at this scale. Each XML file was downloaded, and a distinct Solr record was created for each page (this is somewhat arbitrary – it could be by article or issue instead). The final index contains 1,094,330 distinct pages, but note that pages with no text were discarded (which was probably a mistake in retrospect, as knowing how many pages have no text might be useful/interesting).

You can see the results via: http://openstack9.ad.bl.uk:8983/solr/#/jisc2_shard1_replica1/query

And construct queries like this:

To see the pages that contain the term “British Museum”, sorted by date: http://openstack9.ad.bl.uk:8983/solr/jisc2_shard1_replica1/select?q=%22British%20Museum%22&sort=year_s+asc&rows=20&fl=originalname_s%2C+page_i&wt=xml&indent=true&hl=true&hl.fl=content&hl.fragsize=200&hl.simple.pre=*&hl.simple.post=*
To see the distribution of all pages across the years: http://openstack9.ad.bl.uk:8983/solr/jisc2_shard1_replica1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=year_s or as XML http://openstack7.ad.bl.uk:8983/solr/jisc2_shard2_replica1/select?q=*%3A*&rows=0&wt=xml&indent=true&facet=true&facet.field=year_s
Mentions of “A study in scarlet”, sorted by time, showing fragments for context: http://openstack9.ad.bl.uk:8983/solr/jisc2_shard1_replica1/select?q=%22A+study+in+scarlet%22&sort=year_s+asc&rows=20&fl=originalname_s%2C+page_i&wt=xml&indent=true&hl=true&hl.fl=content&hl.fragsize=200&hl.simple.pre=*&hl.simple.post=*
Mentions of “Sherlock holmes”, as above, but as a short-range proximity search (i.e. up to one word apart): http://openstack9.ad.bl.uk:8983/solr/jisc2_shard1_replica1/select?q=%22sherlock%20holmes%22~1&sort=year_s+asc&rows=20&fl=originalname_s%2C+page_i&wt=xml&indent=true&hl=true&hl.fl=content&hl.fragsize=200&hl.simple.pre=*&hl.simple.post=*

Solr also provides an interface for summary statistics, here: http://openstack9.ad.bl.uk:8983/solr/#/jisc2_shard1_replica1/schema-browser e.g. you can select the field ‘simpletitle_s’ and the hit ‘Load Term Info’ to see the distribution of titles (we have 39, and the largest chunk of content is from the Morning Post). Similarly, you can select ‘originalname_s’ and see there are 146,110 distinct XML filenames, in contrast to the 192,349 lines from the HIVE table, which would appear to imply there are duplicate lines in the database.

I’ve not moved to use openstack6 instead of 8 yet, as I’m not sure how to do this cleanly without risking breaking what we have.

Back to WARC¶

Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FileAlreadyExistsException
    at org.apache.solr.store.hdfs.HdfsLockFactory.makeLock(HdfsLockFactory.java:53)
    at org.apache.lucene.store.BaseDirectory.makeLock(BaseDirectory.java:41)
    at org.apache.solr.store.blockcache.BlockDirectory.makeLock(BlockDirectory.java:283)
	at org.apache.lucene.store.NRTCachingDirectory.makeLock(NRTCachingDirectory.java:109)
    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:705)
    at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:77)
    at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:64)
    at org.apache.solr.core.SolrCore.initIndex(SolrCore.java:511)
    at org.apache.solr.core.SolrCore.<init>(SolrCore.java:761)

I then went to add this to webarchive-discovery, but it turned out that Solr was compiled against Hadoop 2.x.x and so to get the same logic to work I had to backport the HDFS part of solr-core::4.7.1 to Hadoop 0.20.2 (shading the modified classes over the top of the original one). Due to semantic subtleties about fsync(), hflush() and plain old sync(), and FileSystem.get() not being thread-safe unless a specific property is set to avoid caching it, the port took a while.

However, now running on the shuf.1000 and it seems to work fine. Not actually terribly quick, though, as running all three shard reducers on one grunt! (grunt13). Need to re-config so there's e.g. only two reducers per grunt and bump up to the 26 shards (i.e. full production settings).

Yes, the 1000 took 04:xx:xx to complete indexing. Need to look at the way Cloudera does it, as they build LOTS of little shards and then merge them.

See e.g. https://github.com/apache/lucene-solr/blob/trunk/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/TreeMergeOutputFormat.java#L139

Note also that the original SolrOutputFormat code might well work now I've back-ported the HDFS classes to run on 0.20.2.

Initial attempt at setting up with Solr after indexing failed due to index version - 4.7.1 of Solr required.

At the same time, running the same 1000 source files through the jisc3 collection i.e. using 24 shards and reducers. Should be significantly faster, roughly 6 times faster, i.e. should be about c.45 minutes for the reduce phase instead of 3-4 hours. Yes, nice: 1hrs, 19mins, 49sec.

Yes, it works. 6,158,995 resources in three shards. There was a curious error (see also http://tickets.wa.bl.uk/trac/ticket/2338#comment:4) but only on one shard:

    Apr 14, 2014 5:09:25 PM org.apache.catalina.startup.Catalina start
    INFO: Server startup in 319574 ms 3132 2014-04-14 17:09:25.504; [recoveryExecutor-6-thread-1] WARN  org.apache.solr.update.UpdateLog  â Starting log replay tlog{file=/opt/data/solrnode3/ldwadev/data/tlog/tlog.0000000000000000254 refcount=2} active=false starting pos=05504 2014-04-14 17:09:27.876; [recoveryExecutor-6-thread-1] ERROR org.apache.solr.update.UpdateLog  â java.io.EOFException
        at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:154)
        at org.apache.solr.common.util.JavaBinCodec.readStr(JavaBinCodec.java:559)
        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:180)
        at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:477)
        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
        at org.apache.solr.common.util.JavaBinCodec.readSolrInputDocument(JavaBinCodec.java:393)
        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:229)
        at org.apache.solr.common.util.JavaBinCodec.readArray(JavaBinCodec.java:477)
        at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
        at org.apache.solr.update.TransactionLog$LogReader.next(TransactionLog.java:630)
        at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1272)
        at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1215)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

Interesting. Fortunately it doesn't seem to have significantly damaged the data (the transaction log is only really a 'back up' to be used in case the index itself is not shut down correctly), as we have 6.1 million URLs which is roughly what I'd expect from 1/40th of 2.5 billion items.

As a precaution, I've modified the code so that the reducer waits while the data is committed to disk. I'd assumed that was the default, but actually the embedded Solr server we use here runs most tasks on background threads. The situation we've seen here is consistent with the background thread having been forcefully killed, so adding a blocking 'wait' should resolve it, hopefully.

Fixed up the tests again, enabled solr.lock.type configuration, Roger helped resolve a classpath issue (hadoop-core should be 'provided', c.f. http://answers.mapr.com/questions/4811/numberformatexception-setting-up-job). And now running on a random 20,000 input files, which we can expect to take 20-25 hours.

So, running 5 things on grunt22, some are hanging. One was very unhappy. It had this message earlier on:

Exception in thread "Lucene Merge Thread #17" org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException
    at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.NullPointerException
    at org.apache.lucene.util.packed.MonotonicAppendingLongBuffer.get(MonotonicAppendingLongBuffer.java:75)
    at org.apache.lucene.util.packed.AbstractAppendingLongBuffer.get(AbstractAppendingLongBuffer.java:101)
    at org.apache.lucene.index.MultiDocValues$OrdinalMap.getGlobalOrd(MultiDocValues.java:390)
    at org.apache.lucene.codecs.DocValuesConsumer$7$1.setNext(DocValuesConsumer.java:610)
    at org.apache.lucene.codecs.DocValuesConsumer$7$1.hasNext(DocValuesConsumer.java:558)
    at org.apache.lucene.codecs.lucene45.Lucene45DocValuesConsumer.addNumericField(Lucene45DocValuesConsumer.java:141)
    at org.apache.lucene.codecs.lucene45.Lucene45DocValuesConsumer.addSortedSetField(Lucene45DocValuesConsumer.java:414)
    at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addSortedSetField(PerFieldDocValuesFormat.java:121)
    at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedSetField(DocValuesConsumer.java:441)
    at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:207)
    at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:116)
    at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4146)
    at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3743)
    at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
Exception in thread "Lucene Merge Thread #18" org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException
    at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.NullPointerException
Exception in thread "Lucene Merge Thread #19" org.apache.lucene.index.MergePolicy$MergeException: java.lang.NullPointerException
    at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
    at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
Caused by: java.lang.NullPointerException

and then somewhat later, lots of

2014-04-16 09:46:17 INFO  UpdateHandler:540 - start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2014-04-16 09:46:17 ERROR CommitTracker:120 - auto commit error...:org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=1701604449 vs expected header=1071082519 (resource: _mq_Lucene41_0.tip)
    at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128)
    at org.apache.lucene.util.fst.FST.<init>(FST.java:318)
    at org.apache.lucene.util.fst.FST.<init>(FST.java:304)
    at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader.<init>(BlockTreeTermsReader.java:484)
    at org.apache.lucene.codecs.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:176)
    at org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:437)
    at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:195)
    at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:244)
    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:116)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:96)
    at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:141)
    at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:279)
    at org.apache.lucene.index.IndexWriter.applyAllDeletesAndUpdates(IndexWriter.java:3191)
    at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:3182)
    at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2901)
    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3049)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3016)
    at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:578)
    at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

Which is somewhat unnerving, but given the lack of fsync() and hflush() is perhaps not entirely unexpected when the system is under this kind of overload.

I used hadoop job -kill-task to kill the locked attempt and this failing one, to see if it restarts ok. However, the right solution is to reduce the load and only allow two reducers and, say, two or three mappers.

Okay, so killing and re-starting with 3M:2R:2GB rather than 5M:5R:1GB. The mappers took a little over four hours last time, so we'll see if they're quicker or slower when we drop the load on each machine.

Timing seem to be much the same - map phase looks like coming in at about 4 hours. Maybe slightly slower, but not significant, 100% mappers done at 4:30:00.

Sorting now (about 30mins work...) and reducing away. Should be done by around 15 hours as that was when the faster reducers (newer, fatter nodes as it happens) started to complete.

for 10 on grunt36, problem started like this:

2014-04-16 19:26:17 ERROR CommitTracker:120 - auto commit error...:org.apache.solr.common.SolrException: Error opening new searcher
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1521)
    at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:614)
    at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.lucene.index.CorruptIndexException: codec header mismatch: actual header=65537 vs expected header=1071082519 (resource: _3x_Lucene41_0.tip)
    at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:128)
    at org.apache.lucene.util.fst.FST.<init>(FST.java:318)
    at org.apache.lucene.util.fst.FST.<init>(FST.java:304)
    at org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader.<init>(BlockTreeTermsReader.java:484)
    at org.apache.lucene.codecs.BlockTreeTermsReader.<init>(BlockTreeTermsReader.java:176)
    at org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:437)
    at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:195)
    at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:244)
    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:116)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:96)
    at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:141)
    at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:235)
    at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:100)
    at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:382)
    at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:288)
    at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:273)
    at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:250)
    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1445)
    ... 10 more

Later followed by repeating...

2014-04-17 08:30:46 ERROR WARCIndexerReducer:168 - WARCIndexerReducer.reduce() - sleeping for 1 minute: pos=769
org.apache.solr.common.SolrException: pos=769
    at org.apache.solr.update.HdfsTransactionLog.lookup(HdfsTransactionLog.java:267)
    at org.apache.solr.update.UpdateLog.lookup(UpdateLog.java:711)
    at org.apache.solr.handler.component.RealTimeGetComponent.getInputDocumentFromTlog(RealTimeGetComponent.java:217)
    at org.apache.solr.handler.component.RealTimeGetComponent.getInputDocument(RealTimeGetComponent.java:242)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.getUpdatedDocument(DistributedUpdateProcessor.java:892)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:791)
    at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:557)
    at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
    at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
    at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
    at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:118)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.checkSubmission(WARCIndexerReducer.java:162)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:124)
    at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:34)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.io.IOException: Could not obtain block: blk_-2269227342052188925_10888306 file=/user/anjackson/jisc2_shards/shard11/data/tlog/tlog.0000000000000000187
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1993)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:2028)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2116)
    at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:46)
    at org.apache.solr.update.FSDataFastInputStream.readWrappedStream(HdfsTransactionLog.java:545)
    at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:114)
    at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
    at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
    at org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192)
    at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:172)
    at org.apache.solr.update.HdfsTransactionLog.lookup(HdfsTransactionLog.java:262)
    ... 27 more

So, perhaps autocommit is causing issues?

and on 1 other (16 on grunt44) and elsewhere I think, a probably data quality problem with the text.

2014-04-17 09:42:39 ERROR WARCIndexerReducer:168 - WARCIndexerReducer.reduce() - sleeping for 1 minute: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #2958519, byte #3375884) org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #2958519, byte #3375884) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:223) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:118) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.checkSubmission(WARCIndexerReducer.java:162) at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:124) at uk.bl.wa.hadoop.indexer.WARCIndexerReducer.reduce(WARCIndexerReducer.java:34) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #2958519, byte #3375884) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:155) ... 13 more Caused by: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #2958519, byte #3375884) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:397) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:150) ... 13 more Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xfffe at char #2958519, byte #3375884) at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)

So, tweaked config, disabling some caching/NRT code (as is done in Cloudera's mapper) and putting more time between autocommits. Also, this bad data (above) is blocking completion, so recoding to drop data that fails repeatedly, and also will attempt to cleanup data earlier using a UTF-8 encode/decode cycle.

So, two reducers now suffer this error, it seems: 0 and 16. At 20 hours, 14/24 had completed, so that's looking ok, if somewhat variable across the cluster. Sizes are:

[anjackson@explorer ~]$ hadoop fs -du jisc2_shards
Found 24 items
11819458999  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard1
24886167105  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard10
10995867725  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard11
24675012054  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard12
20682623820  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard13
20570860041  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard14
25995039001  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard15
29325605594  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard16
8918695012   hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard17
29250320596  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard18
24395859879  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard19
24550889972  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard2
24869064314  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard20
22475985139  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard21
23519811316  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard22
23880387734  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard23
25369034114  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard24
26383915157  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard3
24074510289  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard4
28070109687  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard5
29602983297  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard6
20425924336  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard7
24060581760  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard8
24771455202  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard9

Mapping is +1, so R10 was the one that crashed, and that's shard 11. Shards 1 and 17 are R0 and R16 which have the bad-data problem. The others are still very variable, from c. 20-30GiB per shard, although some 8 others are still writing data.

Okay, wiping out, recompiling, and restarting. Running with a small test found a minor bug in the new reporting hooks, but now nicely reports numbers of records processed etc. Seems to be pushing 30,000,000 records per hour, c. 8,000 records per second. Will check proper stats later when it's run for a while.

So reducer phase kicked in. Some errors, particularly earlier on, from DFS not being able to provide blocks in time, I think. lost c.2,500 records as a consequence. Having the counters reporting this is a great improvement.

Hm, after 11hrs, 46mins, 47sec, the Reducers have got through 10,306,036 records out of 122,627,207 and are doing quite a lot of retries. This seems somewhat slower than before. Sizes are also much lower:

2130092437  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard1
2585581906  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard10
2574757668  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard11
2198718069  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard12
2419696449  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard13
2560682336  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard14
2121469973  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard15
2120275520  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard16
2601464617  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard17
2177693487  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard18
2370112142  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard19
2173176068  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard2
2367320297  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard20
2291537786  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard21
2217858387  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard22
2196315505  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard23
2176750414  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard24
2407447171  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard3
2407425960  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard4
2171001576  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard5
2198529851  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard6
2232396290  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard7
2218614633  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard8
2365945226  hdfs://nellie-private:54310/user/anjackson/jisc2_shards/shard9

Ok, trying re-enabling block caching and re-running... Ah, that's a bit better. After 6hrs, 53mins, 39sec the reducer has got through 42,146,677 records. There have been 98,300 dropped records, same EOF stuff it seems.

So, after 21:41:18 worked pretty much fine. Mappers saw:

122,627,207 records
22,659,814 null records (i.e. redirects, 404s etc).
3 failed mappers (two due to 'Java heap space' and one mysterious IOException), but all worked on re-submission.
Finished after 04:24:10.

Reducers saw:

122,537,935 submitted ok.
611 errors when submitting, leading to 89,272 dropped records.
Two reduce tasks failed but worked the second time around.
Finished after 21:22:32 (total time, including mappers, so dedicated reducer time is c. 17 hours)
Spread of reducer finish times is rather large - some done in about 10 hours, others taking much longer. Perhaps some nodes are faster than others?

So, currently copying this out, to see how long that takes. It's over half a TB of data, so that's not trivial and we may have to actually run Solr from HDFS if it doesn't transfer reasonably quickly.

Next decision is whether to re-instate any image analysis. Will bump up memory usage and slow things down somewhat, although it's not clear by how much. Another option would be to randomly sample only a subset of images for analysis, e.g. 1 in 100, so we can gauge it's usefulness without crippling the indexer too badly.

(as an aside, it's worth noting that extractApachePreflightErrors is currently enabled and not causing any problems, but image analysis is more computationally expensive and there are more images that PDFs).

Another issue is whether to enabled storage of the term vectors or use the More Like This handler to extract high-ranking (TF-IDF) terms for a given facet (a la in a word.

Hm, only 1/20th of the data and it still takes a whole day of runtime. 20 days is pushing it, and not that much of an improvement over where we were before. I'm really going to have to push the shard count up to ~70 to get a decent speed-up.

Transfering the 24 shard version (1/20th of the collection) off the cluster took about 6 hours. Not too bad, but still a few days to get the data off.

Set up a 48 shard run with 45,000 inputs (1/10th). Ran mappers on 275,912,357 records (51,087,107 nulls) in 16 hours but... problems with DOTUK-HISTORICAL-1996-2010-GROUP-AO-XABBKD-20110428000000-00001.arc.gz

Error: GC overhead limit exceeded
Error: Java heap space
2014-04-25 02:41:45 FATAL Child:318 - Error running child : java.lang.OutOfMemoryError: Java heap space
    at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
    at java.lang.StringCoding.encode(StringCoding.java:272)
    at java.lang.String.getBytes(String.java:946)
    at uk.bl.wa.solr.TikaExtractor.extract(TikaExtractor.java:241)
    at uk.bl.wa.analyser.payload.WARCPayloadAnalysers.analyse(WARCPayloadAnalysers.java:107)
    at uk.bl.wa.indexer.WARCIndexer.extract(WARCIndexer.java:449)
    at uk.bl.wa.indexer.WARCIndexer.extract(WARCIndexer.java:220)
    at uk.bl.wa.hadoop.indexer.WARCIndexerMapper.map(WARCIndexerMapper.java:91)
    at uk.bl.wa.hadoop.indexer.WARCIndexerMapper.map(WARCIndexerMapper.java:28)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)
Task attempt_201404161414_0096_m_020825_2 failed to report status for 20000 seconds. Killing!
Error: GC overhead limit exceeded
2014-04-25 02:53:57 FATAL Child:318 - Error running child : java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.tika.language.ProfilingWriter.addLetter(ProfilingWriter.java:82)
    at org.apache.tika.language.ProfilingWriter.addSeparator(ProfilingWriter.java:87)
    at org.apache.tika.language.ProfilingWriter.write(ProfilingWriter.java:72)
    at org.apache.tika.language.LanguageProfile.<init>(LanguageProfile.java:67)
    at org.apache.tika.language.LanguageProfile.<init>(LanguageProfile.java:71)
    at org.apache.tika.language.LanguageIdentifier.<init>(LanguageIdentifier.java:133)
    at uk.bl.wa.extract.LanguageDetector.detectLanguage(LanguageDetector.java:102)
    at uk.bl.wa.analyser.text.LanguageAnalyser.analyse(LanguageAnalyser.java:54)
    at uk.bl.wa.analyser.text.TextAnalysers.analyse(TextAnalysers.java:74)
    at uk.bl.wa.indexer.WARCIndexer.extract(WARCIndexer.java:461)
    at uk.bl.wa.indexer.WARCIndexer.extract(WARCIndexer.java:220)
    at uk.bl.wa.hadoop.indexer.WARCIndexerMapper.map(WARCIndexerMapper.java:91)
    at uk.bl.wa.hadoop.indexer.WARCIndexerMapper.map(WARCIndexerMapper.java:28)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
    at org.apache.hadoop.mapred.Child.main(Child.java:264)

OKAY, so modified to catch OOME and move on instead of failing totally. Re-running. Got past the mappers ok. 9 hours of sorting.

At 26hrs, 38mins, 50sec, we are at 34,571,153 of 275,918,077 records in the reducers.

In the end, Total Time: 233hrs, 51mins, 33sec. Of 275,917,465 records, 72 errors caused 612 records tobe dropped. These errors almost exclusively occurred during the first few hours of the index build.

Okay, upped to 1000 instead of 100 per submission, (was 500 before, when it seemed to work ok), and launching on the second chunk.

Total: 41hrs, 34mins, 47sec Currently 29,562,045 of 275,034,327 ingested. Started at Tue May 06 15:35:33 BST 2014 Sort fninshed at 7-May-2014 03:12:13 i.e. c. 30 hours for one tenth of this tenth. Definately appears much slower than when processing smaller chunks. KILLING and reverting to smaller chunks - 22500 per job.

So, now: 7hrs, 51mins, 21sec, at 2,517,002 of 137,026,829 records.

GAH.

Going back to the task tracker, 127,000,000 records indexed in 21 hours!

From http://194.66.232.87:50030/jobdetailshistory.jsp?logFile=file%3A%2Fusr%2Flib%2Fhadoop-0.20%2Flogs%2Fhistory%2Fdone%2Fbellie-private_1397654045202_job_201404161414_0020_anjackson_..%252Fia.files.shuf.split.aa.nofails_1397773215102

Submitted At: 17-Apr-2014 23:20:57 Launched At: 17-Apr-2014 23:21:01 (3sec) Finished At: 18-Apr-2014 21:02:19 (21hrs, 41mins, 18sec)

So, git diffing against that time. Minor code changes, some config changes, but many Solr changes.

git diff 'HEAD@{17-Apr-2014 23:20:00}' HEAD .

Changes are since commit e3189e0.

Also, is Cloud config ok? Which config set does the implementation depend on? It caches the config from the server and puts it in the distributed cache.

OK, so new solrconfig.xml has this line:

<!-- Enable/Disable the hdfs cache. -->    
<str name="solr.hdfs.blockcache.enabled">${solr.hdfs.blockcache.enabled:true}</str>

which means this old line (which used to do nothing) from Solate.createEmbeddedSolrServer now kicks in:

System.setProperty("solr.hdfs.blockcache.enabled", "false");

overriding the default behaviour I changed before, in the solr-shade config where solr.hdfs.blockcache.enabled is checked.

This means it is likely that caching is off, and so we are in the same 'slow' case we saw before. There is a slight discrepency as I remember a warning about write caching in the reducer logs, so I should check that. Would also make sense to log whether caching is on or not, so we can tell if this is the case more easily.

Hm, looks like that error would be there whether or not the block cache was enabled overall:

boolean blockCacheEnabled = params.getBool(BLOCKCACHE_ENABLED, true);
boolean blockCacheReadEnabled = params.getBool(BLOCKCACHE_READ_ENABLED, true);
boolean blockCacheWriteEnabled = params.getBool(BLOCKCACHE_WRITE_ENABLED, true);

if (blockCacheWriteEnabled) {
  LOG.warn("Using " + BLOCKCACHE_WRITE_ENABLED + " is currently buggy and can result in readers seeing a corrupted view of the index.");
}

I believe this error is not a concern, in that we don't read it until after we've written. A good thing for Cloudera to check.

Note that solr.hdfs.nrtcachingdirectory.enable is set to false by default, and this seemed to work fine before, so I'll leave that disabled as we are not interested in the NRT case.

Refactored to enable it again, and improved the logging. Now running much better, I think. After 13hrs, 44mins, 2sec, 58,423,038 of 138,007,497 records are in. i.e. around half-way through. After 32hrs, 22mins, 55sec, finished, with 137,823,323 records successfully stored. Actually slower than 24 shards? Will try dropping to 1 reducer per node.

Ok, now with 1 reducer per node: Total time 44hrs, 9mins, 37sec for 137,599,824 records!

Is this getting slower with index size? Well, Gil is not here, so may as well push in the next chunk and get more timings.

[anjackson@explorer warc-hadoop-indexer]$ hadoop jar target/warc-hadoop-indexer-2.0.0-SNAPSHOT-job.jar uk.bl.wa.hadoop.indexer.WARCIndexerRunner -c ../configs/jisc.conf -i ../ia.files.shuf.split.22500.af -o shindex-jisc-split-22500-af

So, it seems to be slowing down, whichis presumably because adding to the index now requires lookups that are too large to cache effectively. NOTE that we have still not hit LDWA size in terms of number of documents (we're only at about 700,000), so separate solr servers handle this better (presumably because they can keep the ID lookup cached in RAM/MMAPped).

I can see two options:

Build separate indexes and merge them later.
- This should work, and is what Cloudera do, but items with the same ID cause duplicate entries, as opposed to adding an item with the same ID, which would be an update and would merge crawl_dates etc.
Re-build on a larger Solr cluster:
- Take the data so far, and go back to firing the docs at a separate SolrCloud.
- This worked okay for LDWA before, i.e. up to 1.1 billion docs.
- Plus, we now have lots of new servers to choose from.

So, I'm willing to accept some failure for totally deal with de-duplication this time, and so have launched a parallel job running on a separate 1/20th of the data (split 22500-ag) which will build a separate index that we'll need to merge in later. We'll see how that goes for speed.

However, I'll also ask Gil to set up a three-server Solr cloud for the 48 shard collection to go on to, and depending on the timing, switch over to updating that directly rather than building shards on the cluster.

Back from HDFS to SolrCloud¶

So, while away at IICP, Gil set up a three-node cluster of 48 shards that includes the data from HDFS (but not the final additional shards). This is called JISC3

Differences to LDWA are DocValues, NIOFS, 48 v 24 shards, plus codebase/config difference.

We are also setting up a clean JISC4 install on LDWA01, which uses DocValues/codebase but MMAP and 24 shards. This then allows performance to be re-evaluated, having sorted out the reducer-to-mapper slowdowns via compression, and dropped the number of reducers down to 1/node.

So, now we run JISC3 and JISC4 side-by-side. We will test indexing performance on both systems. For JISC4, I have also disabled host-level linkage, so only domain-to-domain links remain.

Gil is experimenting with start-up config and related issues for JISC3. Also looking at disk IO rates, as this appears to be a bottleneck.

Currently running the first JISC4 index into the LD01 server. Damn, old indexer logic closed the SolrServer connection, leading to a problem when the reducer key changed:

2014-05-30 15:12:00 INFO WARCIndexerReducer:73 - Configuring reducer, including Solr connection... 2014-05-30 15:12:00 INFO SolrWebServer:79 - Setting up CloudSolrServer client via zookeepers. 2014-05-30 15:12:00 INFO WARCIndexerReducer:89 - Initialisation complete. 2014-05-30 15:12:08 INFO WARCIndexerReducer:204 - Submitted 1000 docs [0] ... 2014-05-30 18:47:08 INFO WARCIndexerReducer:204 - Submitted 308 docs [0] 2014-05-30 18:48:48 ERROR WARCIndexerReducer:227 - Sleeping for 5 minute(s): null java.util.concurrent.RejectedExecutionException at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:1768)

Well damn, that looked pretty quick, but then died because I left the 'to HDFS' logic in that closes the shard after each key. Recompiling with that code fixed... And cleared out Solr (delete+commit) and restarting.

After 6hrs, 15mins, 7sec - 42,449,000 records processed. After 14hrs, 29mins, 46sec - 120,113,818 records processed. After 14hrs, 37mins, 8sec - ALL 120,585,548 RECORDS PROCESSED no drops.

2014-05-31 11:23:24 INFO WARCIndexerReducer:73 - Configuring reducer, including Solr connection... 2014-05-31 11:23:24 INFO SolrWebServer:79 - Setting up CloudSolrServer client via zookeepers. 2014-05-31 11:23:24 INFO WARCIndexerReducer:89 - Initialisation complete. 2014-05-31 11:23:28 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 11:23:30 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 11:23:31 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 11:23:33 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 11:23:34 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 11:23:35 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] ... 2014-05-31 14:02:42 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 14:02:51 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] (above for R3, at 346,3000 records) ... 2014-05-31 14:04:19 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 14:04:22 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 14:04:25 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 14:04:30 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 14:04:43 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 14:04:47 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 14:04:51 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 14:04:56 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] ... 2014-05-31 20:14:29 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 20:14:40 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 20:14:42 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 20:14:46 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 20:14:52 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] ... 2014-05-31 22:21:07 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 22:21:14 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 22:21:16 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 22:21:17 INFO WARCIndexerReducer:209 - Submitted 1000 docs [0] 2014-05-31 22:21:20 INFO WARCIndexerReducer:209 - Submitted 717 docs [0] END

The committed, which took a while, leading to 114,309,425 distinct results.

Running 2/20: 138,559,865 records in total. At 11hrs, 45mins, 19sec - 59,537,000 records processed. At 14hrs, 38mins, 2sec - 79,937,093 records processed. (looking to be c. 1/3rd slower) At 22hrs, 14mins, 7sec - 135,804,093 records processed. At 22hrs, 44mins, 34sec - All 138,559,865 records processed.

Avoided committing the data, in case that slows things down somehow. Could really do with testing that.

Running 3/20: 137,655,975 records in total. At 9hrs, 3mins, 14sec - 30,663,000 records processed. At 12hrs, 18mins, 34sec - 50,197,000 records processed. (looking to be c. 1/6th slower) At 15hrs, 28mins, 27sec - 66,149,000 records processed. At 27hrs, 56mins, 31sec - ALL 137,655,975 records processed. (+5 hrs)

Running 4/20: 138,970,852 records in total. At 9hrs, 47mins, 49sec - 29,392,000 records processed (somewhat slower again) At 12hrs, 24mins, 0sec - 42,417,000 records processed. At 23hrs, 7mins, 33sec - 70,937,574 records processed. At 34hrs, 2mins, 35sec - 119,169,574 records processed.

-at -as -ar -aq

Running 5/20 -ao: 138,035,746 records in total: At 5hrs, 56mins, 13sec - 9,733,001 records processed. At 6hrs, 43mins, 9sec - 17,338,001 At 7hrs, 4mins, 4sec - 20,532,001 At 7hrs, 32mins, 46sec - 24,908,001 At 16hrs, 15mins, 11sec - 83,220,667

Ok, interesting, shards running on .183 are REALLY SLOW (c. 2,000). (note '-DnumShards=48' on 183 shard http://192.168.1.183:8983/solr/#/~cloud - same true for .181/182 but not .203) .203 shards are all REALLY FAST (c. 200,000). .181 and .182 are in the middle (c. 80,000) half as fast as those on .203).

CRASH.

Cleanup.

Running 6/20 -ap: So, jisc3 still crashed. jisc1/2 and ld01 were fine. ld01 fastest, jisc1/2 only a little slower (around 12 hours totoal runtime).

BEFORE 4212 tomcat 20 0 110g 49g 44g S 21.8 39.2 35:20.70 0 93k /opt/java/bin/jav 4069 tomcat 20 0 114g 23g 21g S 30.8 18.8 35:29.51 0 100k /opt/java/bin/jav 3861 tomcat 20 0 116g 18g 15g S 7.6 14.4 36:13.40 0 98k /opt/java/bin/jav 3555 tomcat 20 0 111g 15g 12g S 46.3 12.0 47:27.31 0 108k /opt/java/bin/jav 3648 tomcat 20 0 114g 10g 8.6g S 22.8 8.3 44:54.40 0 89k /opt/java/bin/jav 3437 tomcat 20 0 110g 5.5g 3.1g S 33.7 4.3 31:48.07 0 72k /opt/java/bin/jav AFTER 4212 tomcat 20 0 125g 39g 33g S 127.6 31.1 316:05.59 1420 264k /opt/java/bin/ja 4069 tomcat 20 0 131g 25g 20g S 36.4 20.3 327:59.17 3072 291k /opt/java/bin/jav 3861 tomcat 20 0 131g 18g 14g S 162.6 14.7 319:25.06 72m 294k /opt/java/bin/ja 3555 tomcat 20 0 126g 14g 10g S 133.2 11.9 351:19.75 171m 280k /opt/java/bin/ja 3648 tomcat 20 0 125g 12g 7.3g S 48.3 10.0 256:06.10 160m 275k /opt/java/bin/jav 3437 tomcat 20 0 117g 9.7g 4.4g S 120.0 7.7 615:41.91 121m 244k /opt/java/bin/ja

top - 19:15:15 up 1 day, 7:27, 3 users, load average: 15.50, 21.18, 22.15 Tasks: 514 total, 3 running, 511 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1%us, 25.9%sy, 0.0%ni, 73.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 132123444k total, 131742060k used, 381384k free, 103544k buffers Swap: 16777208k total, 540088k used, 16237120k free, 107471016k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP nFLT COMMAND 7318 tomcat 20 0 126g 25g 22g S 33.9 20.1 363:11.52 6140 10k /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode4/conf/logging.properties 7535 tomcat 20 0 125g 24g 21g S 52.7 19.1 440:47.60 6364 9936 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/logging.properties 7849 tomcat 20 0 127g 21g 18g S 35.1 16.9 441:48.38 10m 9575 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode5/conf/logging.properties 7938 tomcat 20 0 124g 19g 17g S 73.6 15.9 446:34.19 4104 11k /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode6/conf/logging.properties 3648 tomcat 20 0 140g 12g 7.7g S 127.5 10.2 1395:46 239m 433k /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode3/conf/logging.propertie 3437 tomcat 20 0 131g 9.4g 4.3g S 26.4 7.5 1483:50 249m 376k /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode1/conf/logging.properties 2508 root 2 -18 102m 14m 3612 S 0.0 0.0 0:02.87 0 10 /sbin/dmeventd 7107 root 20 0 98284 4008 3036 S 0.0 0.0 4:13.88 0 1 sshd: root@pts/2 2934 haldaemo 20 0 38704 3472 2236 S 17.5 0.0 179:02.43 456 25 hald 9259 postfix 20 0 81352 3376 2496 S 0.0 0.0 0:00.00 0 44 pickup -l -t fifo -u 3398 root 20 0 98284 3208 3016 S 0.0 0.0 20:40.31 776 1 sshd: root@pts/1 3380 root 20 0 98284 3184 3016 S 0.0 0.0 33:17.52 804 0 sshd: root@pts/0 3287 root 20 0 81272 2720 2516 S 0.0 0.0 129:46.09 700 0 /usr/libexec/postfix/master 9302 root 20 0 6204 2712 508 R 0.0 0.0 0:00.01 0 0 pidof -c -o 9300 -o 9295 -o %PPID -x rhsmcertd

tomcat 3465 62.9 8.8 133563512 11664320 ? Sl 20:48 2:54 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode1/conf/logging.properties -Xx:+UseG1GC -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms5120m -Xmx5120m -Dsolr.solr.home=/opt/solrnode1 -Duser.language=en -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode1/jisc4/conf -Dcollection.configName=jisc4cfg -DnumShards=24 -Dsolr.data.dir=/opt/data/solrnode1/jisc4/data -DzkHost=zk01-dev-private.solr.wa.bl.uk:9983,zk02-dev-private.solr.wa.bl.uk:9983,zk03-dev-private.solr.wa.bl.uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat_instances/solrnode1 -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat_instances/solrnode1/tmp org.apache.catalina.startup.Bootstrap start tomcat 3554 52.2 6.2 131334484 8299072 ? Sl 20:49 2:07 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/logging.properties -Xx:+UseG1GC -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms5120m -Xmx5120m -Dsolr.solr.home=/opt/solrnode2 -Duser.language=en -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode2/jisc4/conf -Dcollection.configName=jisc4cfg -DnumShards=24 -Dsolr.data.dir=/opt/data/solrnode2/jisc4/data -DzkHost=zk01-dev-private.solr.wa.bl.uk:9983,zk02-dev-private.solr.wa.bl.uk:9983,zk03-dev-private.solr.wa.bl.uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat_instances/solrnode2 -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat_instances/solrnode2/tmp org.apache.catalina.startup.Bootstrap start tomcat 3638 54.1 5.9 138602448 7892744 ? Sl 20:49 2:11 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode3/conf/logging.properties -Xx:+UseG1GC -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms5120m -Xmx5120m -Dsolr.solr.home=/opt/solrnode3 -Duser.language=en -Duser.country=uk -Dbootstrap_confdir=/opt/solrnode3/jisc4/conf -Dcollection.configName=jisc4cfg -DnumShards=24 -Dsolr.data.dir=/opt/data/solrnode3/jisc4/data -DzkHost=zk01-dev-private.solr.wa.bl.uk:9983,zk02-dev-private.solr.wa.bl.uk:9983,zk03-dev-private.solr.wa.bl.uk:9983 -Djava.endorsed.dirs=/opt/tomcat/endorsed -classpath /opt/tomcat/bin/bootstrap.jar:/opt/tomcat/bin/tomcat-juli.jar -Dcatalina.base=/opt/tomcat_instances/solrnode3 -Dcatalina.home=/opt/tomcat -Djava.io.tmpdir=/opt/tomcat_instances/solrnode3/tmp org.apache.catalina.startup.Bootstrap start [r

...

switched over, stopped two and started three, so four running. Starts to hang.

top - 23:00:37 up 3:06, 1 user, load average: 15.97, 9.06, 6.49 Tasks: 509 total, 1 running, 508 sleeping, 0 stopped, 0 zombie Cpu(s): 0.2%us, 38.5%sy, 0.0%ni, 60.6%id, 0.6%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 132123444k total, 131549872k used, 573572k free, 20292k buffers Swap: 16777208k total, 2156k used, 16775052k free, 115377100k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP nFLT COMMAND 3554 tomcat 20 0 142g 43g 38g S 403.6 34.5 96:07.83 620 81k /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/logging.propertie 4383 tomcat 20 0 124g 26g 24g S 29.1 21.2 13:14.42 0 6334 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode6/conf/logging.properties 4299 tomcat 20 0 127g 26g 23g S 29.1 21.1 12:56.23 0 6050 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode5/conf/logging.properties 4215 tomcat 20 0 126g 24g 21g S 19.4 19.3 12:12.05 0 6030 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode4/conf/logging.properties 2

top - 23:04:25 up 3:10, 1 user, load average: 21.82, 15.22, 9.49 Tasks: 509 total, 1 running, 507 sleeping, 0 stopped, 1 zombie Cpu(s): 0.0%us, 29.0%sy, 0.0%ni, 70.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 132123444k total, 131561344k used, 562100k free, 20408k buffers Swap: 16777208k total, 2156k used, 16775052k free, 115387020k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP nFLT COMMAND 3554 tomcat 20 0 142g 43g 38g S 237.0 34.5 96:26.74 620 81k /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode2/conf/logging.propertie 4383 tomcat 20 0 124g 26g 24g S 2746.6 21.2 16:53.53 0 6334 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode6/conf/logging.properti 4299 tomcat 20 0 127g 26g 23g S 2833.7 21.1 16:42.29 0 6050 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode5/conf/logging.properti 4215 tomcat 20 0 126g 24g 21g S 1617.8 19.3 14:21.11 0 6031 /opt/java/bin/java -Djava.util.logging.config.file=/opt/tomcat_instances/solrnode4/conf/logging.properti

Okay, Gil bumping the RAM in jisc3 up to nearer the 160GB of ld01 (n.b. jisc1/2 have 256GB). Also adding SSD for scratch space to jisc1/2/3.

Right, so, the next chunk (-an) went in fine! With 24 reducers, it struggled a little, but worked in the end.

And the -am next one, with 12 reducers, worked totally fine but took a little longer. Initial reducer load was 2,4,2,4 so rather imbalanced.

Running -al with 15 reducers, implying a reducer load of 3,4,3,5, with ld01 having the 5 because it can take it.

The Re-Index¶

Querying at scale¶

So, now we have the 2 billion item index, how to we set it up for queries?

NOTES

Parameters to tweak:
- https://wiki.apache.org/solr/SimpleFacetParameters#facet.method
- Manual query distribution (not clear if that controls the replica/core, but it should I think) https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
empty records, revisits with no visit, have been found. Presumably some kind of id error.
DocValues were not stored on disk, so additional memory overhead.
Apparently, this adds to significant memory overhead per segment, so some say reduce that number (also messes with boot time). Others say single index can be slower due to serial reads. We should probably try to benchmark and tune.
Fields with many values are a major cause of heap exaustion as the faceting requires the fields to be inverted. Building those inverted tables eats a LOT of RAM (asking for content_metadata_ss consumed all 20GB of available RAM).

Query features:

Field collapsing
Proximity search
Random words and phrases (large set, significant # of hits please).
Filtered on facet queries

last_modified_year license_url parse_error 2482 pdf_pdfa_errors 710424 pdf_pdfa_is_valid xml_root_ns 292 in 61582 documents

content_metadata_ss 19,191,627

postcode 1,454,988 postcode_district 18,768

Older points

Software RAID plus MMAP is BAD.
MMAP is pointless when index >> RAM.
Indexing onto HFDS and the deduplication problem.
Try MMAP again, for queries, with less RAM.
Try docValues onDisk.
Try reducing the volume of data (remove expensive but 'uninteresting' fields - hosts are tricky).
Shine testing
Replicating down to two servers.
Dual-configuration testing. (4 v 2 servers while all 4 are present)
PLUS Selective Archive setup.

Analysing JISC5 Performance¶

We run a query sequence on each, with two or four servers, and with distrib=true on automatic or manually distributed over shards. Summaries generated by Gil can now be loaded and plotted.

In [3]:

%matplotlib inline
import numpy as np
from numpy.random import randn
import pandas as pd
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8, 4)})
np.random.seed(9221999)

In [125]:

def loadSB( file ):
    ins = open( file, "r" )
    qts = []
    nhs = []
    for line in ins:
        cols = line.split()
        if len(cols) > 1 and cols[1] == "A":
            qts.append(float(cols[2])/1000)
            nhs.append(int(cols[3]))
            qts.append(float(cols[5])/1000)
            nhs.append(int(cols[6]))
            qts.append(float(cols[8])/1000)
            nhs.append(int(cols[9]))
            qts.append(float(cols[11])/1000)
            nhs.append(int(cols[12]))
    ins.close()
    return (np.array(qts), np.array(nhs))

qts2, nhs2 = loadSB( "solr-bench/servers-2-distrib-manual.txt" )
qts2a, nhs2a = loadSB( "solr-bench/servers-2-distrib-auto.txt" )
qts4, nhs4 = loadSB( "solr-bench/servers-4-distrib-manual.txt" )
qts4a, nhs4a = loadSB( "solr-bench/servers-4-distrib-auto.txt" )

fig = plt.figure(figsize=(12, 6))
max_data = 30
bins = np.linspace(0, max_data, 50)
plt.hist(qts2, bins, normed=True, alpha=.5, label="Two Servers");
plt.hist(qts4, bins, normed=True, alpha=.5, label="Four Servers");
plt.legend()
plt.xlabel("Query Time [s]")

fig.savefig("solr-bench/QTimeHist.svg")

In [127]:

fig = plt.figure(figsize=(12, 6))
plt.hist(qts2a, bins, normed=True, alpha=.5, label="Two Servers (auto)");
plt.hist(qts4a, bins, normed=True, alpha=.5, label="Four Servers (auto)");
plt.legend()

Out[127]:

<matplotlib.legend.Legend at 0x11d146da0>

In [128]:

fig = plt.figure(figsize=(12, 6))
plt.hist(qts4, bins, normed=True, alpha=.5, label="Four Servers");
plt.hist(qts4a, bins, normed=True, alpha=.5, label="Four Servers (auto)");
plt.legend()

Out[128]:

<matplotlib.legend.Legend at 0x11ecf68d0>

In [129]:

fig = plt.figure(figsize=(12, 6))
plt.hist(qts2, bins, normed=True, alpha=.5, label="Two Servers");
plt.hist(qts2a, bins, normed=True, alpha=.5, label="Two Servers (auto)");
plt.legend()

Out[129]:

<matplotlib.legend.Legend at 0x120ee1f60>

In [120]:

jp = sns.jointplot(qts2, nhs2, kind="scatter", size=10, xlim=(0,250), ylim=(0,2.5e9))
jp.ax_joint.set_xlabel('Query time (two servers) [s]')
jp.ax_joint.set_ylabel('Number of results')
jp.fig.savefig("solr-bench/QTvN-2.svg")

In [121]:

jp4 = sns.jointplot(qts4, nhs4, kind="scatter", size=10, xlim=(0,250), ylim=(0,2.5e9))
jp4.ax_joint.set_xlabel('Query time (four servers) [s]')
jp4.ax_joint.set_ylabel('Number of results')
jp4.fig.savefig("solr-bench/QTvN-4.svg")

In [146]:

data = [qts2a,qts2,qts4a,qts4]
fig = plt.figure(figsize=(10, 10))
plt.ylabel("Query time [s]")
sns.set_context(rc={"figure.figsize": (10, 10)})
ax = sns.boxplot(data, names=["Two Servers (auto)", "Two Servers", "Four Servers (auto)", "Four Servers"], 
                 color="pastel", alpha=0.75, fliersize=0.0 );
ax.semilogy()
plt.scatter(np.repeat(1, qts2a.shape[0]),qts2a)
plt.scatter(np.repeat(2, qts2.shape[0]),qts2)
plt.scatter(np.repeat(3, qts4a.shape[0]),qts4a)
plt.scatter(np.repeat(4, qts4.shape[0]),qts4)
fig.savefig("solr-bench/QTimeBoxPlot.svg")

Faceting so far has been 20 threads. Currently collecting data for...

no-faceting for
- individual servers
- 2M and
- 4M server configurations

In [6]:

import re

def loadSB2( file ):
    ins = open( file, "r" )
    qts = []
    nhs = []
    for line in ins:
        line = re.sub("FACETS-.+\.QTime","FACETS-QTime",line)
        cols = line.split()
        if len(cols) > 1 and re.match(".*QTime.*",cols[0]) and not re.match(".*-ALL.*",cols[0]):
            qt = float(cols[1])/1000.0
            nf = int(cols[3])
            wc = float(cols[5])/1000.0
            qts.append(qt)
            nhs.append(nf)
    ins.close()
    return (np.array(qts), np.array(nhs))

qtnf181, nfnf181 = loadSB2("solr-bench/testv3-no-facets-181-only.log")
qtnf182, nfnf182 = loadSB2("solr-bench/testv3-no-facets-182-only.log")
qtnf203, nfnf203 = loadSB2("solr-bench/testv3-no-facets-203-only.log")
qtnf215, nfnf215 = loadSB2("solr-bench/testv3-no-facets-215-only.log")
qtnf2m, nfnf2m = loadSB2("solr-bench/testv3-no-facets-2servers.log")
qtnf2m2, nfnf2m2 = loadSB2("solr-bench/testv3-no-facets-2servers-2.log")
qtnf4m, nfnf4m = loadSB2("solr-bench/testv3-no-facets-4servers.log")
qtaf4m, nfaf4m = loadSB2("solr-bench/testv3-all-facets-4servers-100threads.log")
qtaf2m_fcs, nfaf2m_fcs = loadSB2("solr-bench/testv3-all-facets-2servers-fcs.log")


fig = plt.figure(figsize=(12, 6))
plt.scatter(np.repeat(1, qtnf181.shape[0]), qtnf181, label="181");
plt.scatter(np.repeat(2, qtnf182.shape[0]), qtnf182, label="182");
plt.scatter(np.repeat(3, qtnf203.shape[0]), qtnf203, label="203");
plt.scatter(np.repeat(4, qtnf215.shape[0]), qtnf215, label="215");
plt.scatter(np.repeat(5, qtnf2m.shape[0]), qtnf2m, label="2m", color="red");
plt.scatter(np.repeat(5.1, qtnf2m2.shape[0]), qtnf2m2, label="2m2", color="red");
plt.scatter(np.repeat(6, qtnf4m.shape[0]), qtnf4m, label="4m", color="red");
plt.scatter(np.repeat(7, qtaf4m.shape[0]), qtaf4m, label="AF-4m", color="green");
plt.scatter(np.repeat(8, qtaf2m_fcs.shape[0]), qtaf2m_fcs, label="AF-2m-FCS", color="green");
plt.semilogy()

Out[6]:

[]

In [33]:

qtaf181_1s_fcs, nfaf181_1s_fcs = loadSB2("solr-bench/testv3-all-facets-181-fcs.log")
qtaf181_1s, nfaf181_1s = loadSB2("solr-bench/testv3-all-facets-181-1shard.log")
qtaf181_2s, nfaf181_2s = loadSB2("solr-bench/testv3-all-facets-181-2s.log")
qtaf181_3s, nfaf181_3s = loadSB2("solr-bench/testv3-all-facets-181-3shards.log")
qtaf181_4s, nfaf181_4s = loadSB2("solr-bench/testv3-all-facets-181-4s-2.log")
qtaf181_6s, nfaf181_6s = loadSB2("solr-bench/testv3-all-facets-181-6shards.log")
qtaf182_6s, nfaf182_6s = loadSB2("solr-bench/testv3-all-facets-182-6shards.log")
qtaf182_4s, nfaf182_4s = loadSB2("solr-bench/testv3-all-facets-182-4s.log")
qtaf182_3s, nfaf182_3s = loadSB2("solr-bench/testv3-all-facets-182-3shards.log")
qtaf203_6s, nfaf203_6s = loadSB2("solr-bench/testv3-all-facets-203-6shards.log")
qtaf215_1x1, nfaf215_1x1 = loadSB2("solr-bench/testv3-all-facets-215-1by1.log")
qtaf215_6s, nfaf215_6s = loadSB2("solr-bench/testv3-all-facets-215-6shards.log")

qtaf181_6s2, nfaf181_6s2 = loadSB2("solr-bench/testv3-all-facets-181-6s-2.log")
qtaf182_6s2, nfaf182_6s2 = loadSB2("solr-bench/testv3-all-facets-182-6s-2.log")
qtaf203_6s2, nfaf203_6s2 = loadSB2("solr-bench/testv3-all-facets-203-6s-2.log")
qtaf215_6s2, nfaf215_6s2 = loadSB2("solr-bench/testv3-all-facets-215-6s-2.log")

fig = plt.figure(figsize=(12, 6))
plt.scatter(np.repeat(1, qtnf181.shape[0]), qtnf181, label="181");
plt.scatter(np.repeat(3, qtnf4m.shape[0]), qtnf4m, label="4m", color="red");
plt.scatter(np.repeat(4, qtaf4m.shape[0]), qtaf4m, label="AF-4m", color="green");
plt.scatter(np.repeat(5, qtaf2m_fcs.shape[0]), qtaf2m_fcs, label="AF-2m-FCS", color="green");

plt.scatter(np.repeat(6, qtaf181_1s_fcs.shape[0]), qtaf181_1s_fcs, label="AF-181-1s-FCS", color="blue");
plt.scatter(np.repeat(6.5, qtaf181_1s.shape[0]), qtaf181_1s, label="AF-181-1s", color="blue");
plt.scatter(np.repeat(7, qtaf181_2s.shape[0]), qtaf181_2s, label="AF-181-2s", color="blue");
plt.scatter(np.repeat(7.5, qtaf181_3s.shape[0]), qtaf181_3s, label="AF-181-3s", color="blue");
plt.scatter(np.repeat(8, qtaf181_4s.shape[0]), qtaf181_4s, label="AF-181-4s", color="blue");
plt.scatter(np.repeat(8.5, qtaf181_6s.shape[0]), qtaf181_6s, label="AF-181-6s", color="blue");

plt.scatter(np.repeat(9, qtaf182_3s.shape[0]), qtaf182_3s, label="AF-182-3s", color="orange");
plt.scatter(np.repeat(9.5, qtaf182_4s.shape[0]), qtaf182_4s, label="AF-182-4s", color="orange");
plt.scatter(np.repeat(10, qtaf182_6s.shape[0]), qtaf182_6s, label="AF-182-6s", color="orange");

plt.scatter(np.repeat(12, qtaf203_6s.shape[0]), qtaf203_6s, label="AF-203-6s", color="red");

plt.scatter(np.repeat(13, qtaf215_1x1.shape[0]), qtaf215_1x1, label="AF-215-1by1", color="purple");
plt.scatter(np.repeat(14, qtaf215_6s.shape[0]), qtaf215_6s, label="AF-215-6s-1zero", color="purple");
plt.scatter(np.repeat(15, qtaf215_6s2.shape[0]), qtaf215_6s2, label="AF-215-6s2", color="purple");

#plt.semilogy()

Out[33]:

<matplotlib.collections.PathCollection at 0x10f91e748>

In [31]:

fig = plt.figure(figsize=(12, 6))
plt.xlim([1,1e10])
plt.semilogx()
plt.semilogy()
plt.scatter(nfnf4m, qtnf4m, label="NF-4m", color="green");
plt.scatter(nfnf2m, qtnf2m, label="NF-2m", color="red");
plt.legend()

Out[31]:

<matplotlib.legend.Legend at 0x10e0c9b38>

In [36]:

fig = plt.figure(figsize=(12, 6))
plt.xlim([1,1e10])
plt.semilogx()
plt.semilogy()
plt.scatter(nfaf181_6s, qtaf181_6s, label="181 6 shards", color="red");
plt.scatter(nfaf182_6s, qtaf182_6s, label="182 6 shards", color="blue");
plt.scatter(nfaf203_6s, qtaf203_6s, label="203 6 shards", color="green");
plt.scatter(nfaf215_6s2, qtaf215_6s2, label="215 6 shards", color="lightgreen");
plt.scatter(nfaf4m, qtaf4m, label="4 Servers", color="purple");
plt.legend()

Out[36]:

<matplotlib.legend.Legend at 0x10f270898>

all-facets for
- 4M servers, with more threads.
- 4M servers, with no threads specified.
- 4M servers, enum method
- 4M servers, fcs method
- As above, but against a single shard (distrib-false).

In [ ]: